HIV Viral Load Prediction

Introduction

Premise: Prognostic prediction has empirically proven to be a highly effective paradigm that is radically reshaping public health, clinical medicine, and healthcare as a domain. Despite its immense benefits, far too little has been done in utilizing its promises to guide the current WHO HIV viral load informed care model. Currently, estimating the risk of virologic failure is typically at the discretion of the clinician and heavily based on the provider’s opinion. This is compounded by other complications, often leading to delayed administration of essential interventions. Particularly in resource-limited settings, commonly characterized by an acute shortage of healthcare workforce.

Objective: To achieve maximum health impact of the current WHO model, timely detection of potential virologic failure is critical in preventing adverse clinical trajectories; such as treatment failure and immunological deterioration. As such, this project aims to assess modern techniques that can be utilized in HIV clinical settings with rich EHR data to proactively anticipate and mitigate the risk of virologic failures before they manifest.

Methods: A series of statistical learning models consisting of parametric, non-parametric, ensembles, and Bayesian approach will be trained and evaluated using dataset extracted from https://www.iedea.org/. IeDEA is an international research consortium established in 2006 by the National Institute of Allergy and Infectious Diseases to provide a rich resource for globally diverse HIV/AIDS data. Cross-validated metrics such as sensitivity and specificity will be used to evaluate the performance of each model in distinguishing between low-risk and high-risk patients.

Datasource: The IeDEA Cohort Consortium collaborates hosts deidentified data on 1.7 million HIV/AIDS patients. Data is collected from seven international regions, including four in Africa, and one each in the Asia-Pacific region, the Central/South America/Caribbean region, and North America. Each region has data centers that consolidate, curate, and analyze data.

Preliminary Analysis

Data Exploration

In this section we will explore our data before using it to create 3 models for the binary classification task of predicting whether or not a patient is suppressed using classical binary classification models.

Import Wrangled Data

This is a 1 row per patient dataset created in the previous section with the following features. Note, not all of these features wil be used for prediction. We will only be using baseline features.

Viral Load Flow Chat

First Viral Load

Out of the 30,063 patients, only 10,270 patients had a baseline viral load. Out of 10,270, 78% of them were virally suppressed

patients_missing_vl patients_with_vl suppressed not_suppressed suppression_prevalence
19393 10180 8002 2178 0.7860511

Second Viral Load

patients_missing_vl patients_with_vl suppressed not_suppressed suppression_prevalence
23500 6073 4900 1173 0.80685

Third Viral Load

patients_missing_vl patients_with_vl suppressed not_suppressed suppression_prevalence
26286 3287 2638 649 0.8025555

Forth Viral Load

patients_missing_vl patients_with_vl suppressed not_suppressed suppression_prevalence
28198 1375 1094 281 0.7956364

Fifth Viral Load

patients_missing_vl patients_with_vl suppressed not_suppressed suppression_prevalence
29210 363 256 107 0.7052342

Covariates

type variable missing complete n n_unique top_counts ordered mean sd p0 p25 p50 p75 p100 hist
factor abdominal_pexam 0 21260 21260 3 Nor: 13417, Ukn: 7666, Abn: 177, NA: 0 FALSE NA NA NA NA NA NA NA NA
factor arv_change_reason 0 21260 21260 18 N/A: 20560, 829: 182, 193: 145, 102: 114 FALSE NA NA NA NA NA NA NA NA
factor arv_elgibility_reason 0 21260 21260 19 562: 14140, 950: 3879, 120: 671, 177: 600 FALSE NA NA NA NA NA NA NA NA
factor bmi_status 0 21260 21260 5 Nor: 11616, Und: 4251, Ove: 2646, Obe: 1603 FALSE NA NA NA NA NA NA NA NA
factor cardiac_pexam 0 21260 21260 3 Nor: 15670, Ukn: 5554, Abn: 36, NA: 0 FALSE NA NA NA NA NA NA NA NA
factor contraceptive 0 21260 21260 25 110: 11127, 190: 6475, 527: 950, 907: 634 FALSE NA NA NA NA NA NA NA NA
factor cryptococcus_tx 0 21260 21260 2 110: 21137, 747: 123, NA: 0 FALSE NA NA NA NA NA NA NA NA
factor cur_pcp_prophylaxis 0 21260 21260 3 916: 15092, 110: 6025, 92: 143, NA: 0 FALSE NA NA NA NA NA NA NA NA
factor cxr_code_labs 0 21260 21260 3 Ukn: 19998, Nor: 788, Abn: 474, NA: 0 FALSE NA NA NA NA NA NA NA NA
factor drug_toxicity_cause 0 21260 21260 13 110: 21121, 512: 45, 562: 36, 3: 12 FALSE NA NA NA NA NA NA NA NA
factor drug_toxicity_effects 0 21260 21260 26 110: 21103, 512: 52, 877: 16, 562: 14 FALSE NA NA NA NA NA NA NA NA
factor drug_toxicity_severity 0 21260 21260 4 110: 21132, 174: 63, 174: 40, 174: 25 FALSE NA NA NA NA NA NA NA NA
factor extremies_pexam 0 21260 21260 3 Nor: 13488, Ukn: 7643, Abn: 129, NA: 0 FALSE NA NA NA NA NA NA NA NA
factor family_tx_support 0 21260 21260 26 110: 18341, 140: 707, 727: 580, 727: 567 FALSE NA NA NA NA NA NA NA NA
factor first_arv_adherence 0 21260 21260 4 Unk: 19026, GOO: 2151, POO: 50, FAI: 33 FALSE NA NA NA NA NA NA NA NA
factor first_arv_meds 0 21260 21260 84 0: 10767, 696: 7091, 106: 938, 106: 767 FALSE NA NA NA NA NA NA NA NA
factor first_location 0 21260 21260 82 loc: 2348, loc: 2228, loc: 1835, loc: 1147 FALSE NA NA NA NA NA NA NA NA
factor general_pexam 0 21260 21260 3 Ukn: 16184, Nor: 3330, Abn: 1746, NA: 0 FALSE NA NA NA NA NA NA NA NA
factor health_cover 0 21260 21260 7 110: 16230, NHI: 2725, 106: 1342, 562: 943 FALSE NA NA NA NA NA NA NA NA
factor heent_pexam 0 21260 21260 3 Nor: 13349, Ukn: 7648, Abn: 263, NA: 0 FALSE NA NA NA NA NA NA NA NA
factor hospitalization_loc 0 21260 21260 4 N/A: 21114, 127: 71, 127: 51, 562: 24 FALSE NA NA NA NA NA NA NA NA
factor hospitalization_rsn 0 21260 21260 103 N/A: 20889, 123: 80, 43: 29, 197: 23 FALSE NA NA NA NA NA NA NA NA
factor immunization_status 0 21260 21260 5 N/A: 21117, 106: 79, 562: 53, 106: 9 FALSE NA NA NA NA NA NA NA NA
factor last_arv_adherence 0 21260 21260 4 Unk: 10823, GOO: 10195, POO: 122, FAI: 120 FALSE NA NA NA NA NA NA NA NA
factor last_arv_meds 0 21260 21260 96 696: 11977, 0: 6461, 646: 587, 631: 431 FALSE NA NA NA NA NA NA NA NA
factor last_location 0 21260 21260 80 loc: 2333, loc: 2244, loc: 1998, loc: 1133 FALSE NA NA NA NA NA NA NA NA
factor lymph_nodes_pexam 0 21260 21260 3 Nor: 13210, Ukn: 7715, Abn: 335, NA: 0 FALSE NA NA NA NA NA NA NA NA
factor musculoskeletal_pexam 0 21260 21260 3 Nor: 13511, Ukn: 7681, Abn: 68, NA: 0 FALSE NA NA NA NA NA NA NA NA
factor neurologic_pexam 0 21260 21260 3 Nor: 13355, Ukn: 7868, Abn: 37, NA: 0 FALSE NA NA NA NA NA NA NA NA
factor not_onart_reason 0 21260 21260 5 N/A: 20375, 143: 584, 562: 184, 548: 80 FALSE NA NA NA NA NA NA NA NA
factor nutrition_status 0 21260 21260 6 N/A: 17371, 111: 3235, 947: 278, 689: 170 FALSE NA NA NA NA NA NA NA NA
factor pcp_change_reason 0 21260 21260 5 N/A: 21159, 102: 69, 562: 29, 704: 2 FALSE NA NA NA NA NA NA NA NA
factor pcp_prophy_adherence 0 21260 21260 9 634: 9927, N/A: 6865, 116: 4157, 665: 135 FALSE NA NA NA NA NA NA NA NA
factor phdp_referral 0 21260 21260 10 110: 11397, 548: 7617, 117: 1321, 830: 504 FALSE NA NA NA NA NA NA NA NA
factor point_of_hiv_daignosis 0 21260 21260 13 562: 14658, 217: 3847, 204: 1508, 562: 282 FALSE NA NA NA NA NA NA NA NA
factor poor_adherence_rsn 0 21260 21260 18 110: 20251, 164: 218, 610: 201, 562: 150 FALSE NA NA NA NA NA NA NA NA
factor psychiatric_pexam 0 21260 21260 3 Nor: 13426, Ukn: 7779, Abn: 55, NA: 0 FALSE NA NA NA NA NA NA NA NA
factor pulse_status 0 21260 21260 3 Nor: 17204, hig: 3032, low: 1024, NA: 0 FALSE NA NA NA NA NA NA NA NA
factor referral_ordered 0 21260 21260 30 110: 12109, 548: 2829, 548: 2278, 158: 863 FALSE NA NA NA NA NA NA NA NA
factor respiratory_pexam 0 21260 21260 3 Nor: 13088, Ukn: 7598, Abn: 574, NA: 0 FALSE NA NA NA NA NA NA NA NA
factor skin_pexam 0 21260 21260 3 Nor: 12526, Ukn: 7653, Abn: 1081, NA: 0 FALSE NA NA NA NA NA NA NA NA
factor sti_symptoms 0 21260 21260 20 110: 20069, 599: 204, 620: 171, 623: 133 FALSE NA NA NA NA NA NA NA NA
factor sulf_peni_other_reactions 0 21260 21260 8 110: 21171, 512: 38, 562: 23, 879: 15 FALSE NA NA NA NA NA NA NA NA
factor tb_assmt_status 0 21260 21260 4 110: 20033, 697: 746, 617: 415, 111: 66 FALSE NA NA NA NA NA NA NA NA
factor tb_prop_change_rsn 0 21260 21260 6 N/A: 20307, 126: 879, 562: 33, 102: 31 FALSE NA NA NA NA NA NA NA NA
factor tb_prophy_regimen 0 21260 21260 2 110: 17017, 656: 4243, NA: 0 FALSE NA NA NA NA NA NA NA NA
factor tb_symptoms 0 21260 21260 15 110: 19443, 617: 908, 136: 235, 596: 199 FALSE NA NA NA NA NA NA NA NA
factor tb_tx_change_rsn 0 21260 21260 4 N/A: 21114, 126: 119, 562: 16, 102: 11 FALSE NA NA NA NA NA NA NA NA
factor tb_tx_phase 0 21260 21260 5 110: 20730, 619: 373, 619: 154, 619: 2 FALSE NA NA NA NA NA NA NA NA
factor tb_tx_regimen 0 21260 21260 14 110: 20048, 113: 601, 106: 517, 119: 68 FALSE NA NA NA NA NA NA NA NA
factor tb_tx_restart_rsn 0 21260 21260 6 N/A: 21069, 697: 166, 697: 15, 698: 8 FALSE NA NA NA NA NA NA NA NA
factor toxic_drug 0 21260 21260 12 110: 21200, 633: 21, 916: 14, 656: 8 FALSE NA NA NA NA NA NA NA NA
factor urogenital_pexam 0 21260 21260 3 Nor: 11376, Ukn: 5609, Abn: 4275, NA: 0 FALSE NA NA NA NA NA NA NA NA
factor vl_1_date 0 21260 21260 733 emp: 19744, 201: 9, 201: 9, 201: 8 FALSE NA NA NA NA NA NA NA NA
integer adherence_changes 0 21260 21260 NA NA NA 0.051 0.35 0 0 0 0 9 ▇▁▁▁▁▁▁▁
integer alcohol_consumer 0 21260 21260 NA NA NA 0.14 0.34 0 0 0 0 1 ▇▁▁▁▁▁▁▁
integer arv_lines_changed 0 21260 21260 NA NA NA 0.007 0.1 0 0 0 0 5 ▇▁▁▁▁▁▁▁
integer arv_meds_changed 0 21260 21260 NA NA NA 0.15 0.48 0 0 0 0 7 ▇▁▁▁▁▁▁▁
integer changed_location 0 21260 21260 NA NA NA 0.2 0.76 0 0 0 0 11 ▇▁▁▁▁▁▁▁
integer changed_who_stages 0 21260 21260 NA NA NA 0.1 0.47 0 0 0 0 7 ▇▁▁▁▁▁▁▁
integer cig_smoker 0 21260 21260 NA NA NA 0.051 0.22 0 0 0 0 1 ▇▁▁▁▁▁▁▁
integer clinical_problem_rptd 0 21260 21260 NA NA NA 0.13 0.33 0 0 0 0 1 ▇▁▁▁▁▁▁▁
integer crag_labs 0 21260 21260 NA NA NA 664 0 664 664 664 664 664 ▁▁▁▇▁▁▁▁
integer cur_on_other_meds 0 21260 21260 NA NA NA 0.32 0.47 0 0 0 1 1 ▇▁▁▁▁▁▁▃
integer days_b4_next_vl 0 21260 21260 NA NA NA 70.06 2501.88 0 0 1 57 364401 ▇▁▁▁▁▁▁▁
integer days_btwn_apptmts 0 21260 21260 NA NA NA 0.18 24.68 0 0 0 0 3597 ▇▁▁▁▁▁▁▁
integer facility_volume 0 21260 21260 NA NA NA 7402.74 4519.18 1 4805 6211 9833 16564 ▃▁▇▃▅▁▂▂
integer first_age 0 21260 21260 NA NA NA 38.12 11.02 18 30 36 44 103 ▃▇▅▂▁▁▁▁
integer first_arv_line 0 21260 21260 NA NA NA 0.47 0.57 0 0 0 1 3 ▇▁▆▁▁▁▁▁
integer first_days_pregnant 0 21260 21260 NA NA NA 0 0 0 0 0 0 0 ▁▁▁▇▁▁▁▁
integer first_pcs 0 21260 21260 NA NA NA 6076.1 337.27 1286 6101 6101 6101 6101 ▁▁▁▁▁▁▁▇
integer first_who_stage 0 21260 21260 NA NA NA 0.62 1 0 0 0 1 4 ▇▂▁▁▁▁▁▁
integer has_abnormal_oxy_sat 0 21260 21260 NA NA NA 0.022 0.15 0 0 0 0 1 ▇▁▁▁▁▁▁▁
integer has_been_hospitalized 0 21260 21260 NA NA NA 0.017 0.13 0 0 0 0 1 ▇▁▁▁▁▁▁▁
integer has_changed_pcp 0 21260 21260 NA NA NA 0.0048 0.069 0 0 0 0 1 ▇▁▁▁▁▁▁▁
integer has_changed_tb_prop 0 21260 21260 NA NA NA 0.045 0.21 0 0 0 0 1 ▇▁▁▁▁▁▁▁
integer has_changed_tb_tx 0 21260 21260 NA NA NA 0.0069 0.083 0 0 0 0 1 ▇▁▁▁▁▁▁▁
integer has_drug_tox_efcts 0 21260 21260 NA NA NA 0.0074 0.086 0 0 0 0 1 ▇▁▁▁▁▁▁▁
integer has_fever 0 21260 21260 NA NA NA 0.015 0.12 0 0 0 0 1 ▇▁▁▁▁▁▁▁
integer has_heptis_b 0 21260 21260 NA NA NA 0.00028 0.017 0 0 0 0 1 ▇▁▁▁▁▁▁▁
integer has_high_bp 0 21260 21260 NA NA NA 0.36 0.48 0 0 0 1 1 ▇▁▁▁▁▁▁▅
integer has_low_bp 0 21260 21260 NA NA NA 0.024 0.15 0 0 0 0 1 ▇▁▁▁▁▁▁▁
integer has_phdp_referral 0 21260 21260 NA NA NA 0.46 0.5 0 0 0 1 1 ▇▁▁▁▁▁▁▇
integer has_referral_order 0 21260 21260 NA NA NA 0.43 0.5 0 0 0 1 1 ▇▁▁▁▁▁▁▆
integer has_restarted_tb_tx 0 21260 21260 NA NA NA 0.009 0.094 0 0 0 0 1 ▇▁▁▁▁▁▁▁
integer has_sti_symptoms 0 21260 21260 NA NA NA 0.056 0.23 0 0 0 0 1 ▇▁▁▁▁▁▁▁
integer has_sulf_peni_rxns 0 21260 21260 NA NA NA 0.0042 0.065 0 0 0 0 1 ▇▁▁▁▁▁▁▁
integer has_tb_symptoms 0 21260 21260 NA NA NA 0.085 0.28 0 0 0 0 1 ▇▁▁▁▁▁▁▁
integer has_toxic_drug 0 21260 21260 NA NA NA 0.0028 0.053 0 0 0 0 1 ▇▁▁▁▁▁▁▁
integer has_used_contraceptive 0 21260 21260 NA NA NA 0.07 0.25 0 0 0 0 1 ▇▁▁▁▁▁▁▁
integer having_drug_toxicity 0 21260 21260 NA NA NA 0.0095 0.097 0 0 0 0 1 ▇▁▁▁▁▁▁▁
integer hospitalized_recently 0 21260 21260 NA NA NA 0.56 0.5 0 0 1 1 1 ▆▁▁▁▁▁▁▇
integer is_abdominal_pexam 0 21260 21260 NA NA NA 0.0083 0.091 0 0 0 0 1 ▇▁▁▁▁▁▁▁
integer is_breastfeeding 0 21260 21260 NA NA NA 0.014 0.12 0 0 0 0 1 ▇▁▁▁▁▁▁▁
integer is_cardiac_pexam 0 21260 21260 NA NA NA 0.0017 0.041 0 0 0 0 1 ▇▁▁▁▁▁▁▁
integer is_cxr_code_labs 0 21260 21260 NA NA NA 0.022 0.15 0 0 0 0 1 ▇▁▁▁▁▁▁▁
integer is_extremies_pexam 0 21260 21260 NA NA NA 0.0061 0.078 0 0 0 0 1 ▇▁▁▁▁▁▁▁
integer is_general_pexam 0 21260 21260 NA NA NA 0.082 0.27 0 0 0 0 1 ▇▁▁▁▁▁▁▁
integer is_heent_pexam 0 21260 21260 NA NA NA 0.012 0.11 0 0 0 0 1 ▇▁▁▁▁▁▁▁
integer is_lymph_nodes_pexam 0 21260 21260 NA NA NA 0.016 0.12 0 0 0 0 1 ▇▁▁▁▁▁▁▁
integer is_male 0 21260 21260 NA NA NA 0.32 0.47 0 0 0 1 1 ▇▁▁▁▁▁▁▃
integer is_musculoskeletal_pexam 0 21260 21260 NA NA NA 0.0032 0.056 0 0 0 0 1 ▇▁▁▁▁▁▁▁
integer is_neurologic_pexam 0 21260 21260 NA NA NA 0.0017 0.042 0 0 0 0 1 ▇▁▁▁▁▁▁▁
integer is_on_contraceptive 0 21260 21260 NA NA NA 0.48 0.5 0 0 0 1 1 ▇▁▁▁▁▁▁▇
integer is_on_cryptococcus_tx 0 21260 21260 NA NA NA 0.0058 0.076 0 0 0 0 1 ▇▁▁▁▁▁▁▁
integer is_on_health_cover 0 21260 21260 NA NA NA 0.24 0.43 0 0 0 0 1 ▇▁▁▁▁▁▁▂
integer is_on_tb_prophy_regimen 0 21260 21260 NA NA NA 0.2 0.4 0 0 0 0 1 ▇▁▁▁▁▁▁▂
integer is_psychiatric_pexam 0 21260 21260 NA NA NA 0.0026 0.051 0 0 0 0 1 ▇▁▁▁▁▁▁▁
integer is_respiratory_pexam 0 21260 21260 NA NA NA 0.027 0.16 0 0 0 0 1 ▇▁▁▁▁▁▁▁
integer is_skin_pexam 0 21260 21260 NA NA NA 0.051 0.22 0 0 0 0 1 ▇▁▁▁▁▁▁▁
integer is_status_disclosed 0 21260 21260 NA NA NA 0.1 0.3 0 0 0 0 1 ▇▁▁▁▁▁▁▁
integer is_symptomatic 0 21260 21260 NA NA NA 0.018 0.13 0 0 0 0 1 ▇▁▁▁▁▁▁▁
integer is_underweight 0 21260 21260 NA NA NA 0.2 0.4 0 0 0 0 1 ▇▁▁▁▁▁▁▂
integer is_urogenital_pexam 0 21260 21260 NA NA NA 0.2 0.4 0 0 0 0 1 ▇▁▁▁▁▁▁▂
integer last_age 0 21260 21260 NA NA NA 38.12 11.02 18 30 36 44 103 ▃▇▅▂▁▁▁▁
integer last_arv_line 0 21260 21260 NA NA NA 0.71 0.55 0 0 1 1 3 ▃▁▇▁▁▁▁▁
integer last_days_pregnant 0 21260 21260 NA NA NA 6.02 25.71 0 0 0 0 266 ▇▁▁▁▁▁▁▁
integer last_pcs 0 21260 21260 NA NA NA 6003.92 670.13 1286 6101 6101 6101 9068 ▁▁▁▁▇▁▁▁
integer last_who_stage 0 21260 21260 NA NA NA 1.01 1.15 0 0 1 2 4 ▇▆▁▂▁▂▁▁
integer max_days_btwn_apptmts 0 21260 21260 NA NA NA 58.97 112.34 0 0 30 73.25 3597 ▇▁▁▁▁▁▁▁
integer max_days_on_arvs 0 21260 21260 NA NA NA 148.76 355.1 -85 0 15 188 5480 ▇▁▁▁▁▁▁▁
integer max_days_on_treatment 0 21260 21260 NA NA NA 155.52 267.08 0 0 68 210 16398 ▇▁▁▁▁▁▁▁
integer max_days_pregnant 0 21260 21260 NA NA NA 7.58 29.15 0 0 0 0 267 ▇▁▁▁▁▁▁▁
integer max_tb_prop_days 0 21260 21260 NA NA NA 15.06 51.73 0 0 0 0 1184 ▇▁▁▁▁▁▁▁
integer max_tb_tx_days 0 21260 21260 NA NA NA 11.47 69.98 0 0 0 0 3762 ▇▁▁▁▁▁▁▁
integer min_days_btwn_apptmts 0 21260 21260 NA NA NA 0.006 0.59 0 0 0 0 77 ▇▁▁▁▁▁▁▁
integer min_days_on_arvs 0 21260 21260 NA NA NA 46.87 316.09 -85 0 0 1 5480 ▇▁▁▁▁▁▁▁
integer min_days_on_treatment 0 21260 21260 NA NA NA 21.25 193.7 0 0 0 0 16259 ▇▁▁▁▁▁▁▁
integer min_days_pregnant 0 21260 21260 NA NA NA 0 0 0 0 0 0 0 ▁▁▁▇▁▁▁▁
integer min_tb_prop_days 0 21260 21260 NA NA NA 0 0 0 0 0 0 0 ▁▁▁▇▁▁▁▁
integer min_tb_tx_days 0 21260 21260 NA NA NA 0.48 9.78 0 0 0 0 1101 ▇▁▁▁▁▁▁▁
integer needs_fam_tx_support 0 21260 21260 NA NA NA 0.14 0.34 0 0 0 0 1 ▇▁▁▁▁▁▁▁
integer num_bad_adherence 0 21260 21260 NA NA NA 0.047 0.29 0 0 0 0 9 ▇▁▁▁▁▁▁▁
integer num_days_in_care 0 21260 21260 NA NA NA 155.52 267.08 0 0 68 210 16398 ▇▁▁▁▁▁▁▁
integer num_days_on_arvs 0 21260 21260 NA NA NA 148.76 355.1 -85 0 15 188 5480 ▇▁▁▁▁▁▁▁
integer num_days_on_tb_meds 0 21260 21260 NA NA NA 11.47 69.98 0 0 0 0 3762 ▇▁▁▁▁▁▁▁
integer num_days_on_tb_prop 0 21260 21260 NA NA NA 15.06 51.73 0 0 0 0 1184 ▇▁▁▁▁▁▁▁
integer num_defaulted_apptmt 0 21260 21260 NA NA NA 0.21 0.53 0 0 0 0 7 ▇▁▁▁▁▁▁▁
integer num_encounters 0 21260 21260 NA NA NA 4.04 3.62 1 1 3 6 29 ▇▃▁▁▁▁▁▁
integer num_encs_b4_vl1 0 21260 21260 NA NA NA 3.97 3.68 0 1 3 6 29 ▇▃▂▁▁▁▁▁
integer num_pcs_changes 0 21260 21260 NA NA NA 0.029 0.21 0 0 0 0 4 ▇▁▁▁▁▁▁▁
integer other_meds_allergy 0 21260 21260 NA NA NA 0.57 0.5 0 0 1 1 1 ▆▁▁▁▁▁▁▇
integer penicillin_allergy 0 21260 21260 NA NA NA 0.67 0.47 0 0 1 1 1 ▃▁▁▁▁▁▁▇
integer person_id 0 21260 21260 NA NA NA 784602.92 87672.43 55952 767575.75 794491.5 827641.25 861938 ▁▁▁▁▁▁▂▇
integer sulfa_allergy 0 21260 21260 NA NA NA 0.67 0.47 0 0 1 1 1 ▃▁▁▁▁▁▁▇
integer suppressed 0 21260 21260 NA NA NA 0.29 0.45 0 0 0 1 1 ▇▁▁▁▁▁▁▃
integer tb_afb_labs 0 21260 21260 NA NA NA 664.99 39.17 664 664 664 664 2303 ▇▁▁▁▁▁▁▁
integer tb_culture_labs 0 21260 21260 NA NA NA 664 0.27 664 664 664 664 703 ▇▁▁▁▁▁▁▁
integer tb_gene_xp_labs 0 21260 21260 NA NA NA 664.05 1.34 664 664 664 664 703 ▇▁▁▁▁▁▁▁
integer vdrl_labs 0 21260 21260 NA NA NA 666.05 33.87 664 664 664 664 1229 ▇▁▁▁▁▁▁▁
integer vl_count_1 13536 7724 21260 NA NA NA 26920.51 2e+05 0 0 0 581.25 8e+06 ▇▁▁▁▁▁▁▁
integer vl_count_2 16645 4615 21260 NA NA NA 17165.74 175479.66 0 0 0 472 1e+07 ▇▁▁▁▁▁▁▁
integer vl_count_3 18850 2410 21260 NA NA NA 11132.73 88214.64 0 0 0 549.5 3266217 ▇▁▁▁▁▁▁▁
integer vl_count_4 20288 972 21260 NA NA NA 12718.32 67014.09 0 0 55 652.5 987760 ▇▁▁▁▁▁▁▁
integer vl_count_5 21016 244 21260 NA NA NA 20387.95 82791.9 0 0 136 2144.5 869508 ▇▁▁▁▁▁▁▁
integer vl_count_6 21199 61 21260 NA NA NA 24954.18 67061.61 0 0 780 12325 362511 ▇▁▁▁▁▁▁▁
integer vl_count_7 21243 17 21260 NA NA NA 10996.06 36603.48 0 0 0 747 150360 ▇▁▁▁▁▁▁▁
integer vl_count_8 21260 0 21260 NA NA NA NaN NA NA NA NA NA NA
numeric avg_cd4_perc 21251 9 21260 NA NA NA 35.78 30.59 2 13 26 49 98 ▇▅▁▅▂▁▁▂
numeric avg_days_btwn_apptmts 0 21260 21260 NA NA NA 22.82 33.33 0 0 17.25 33 928 ▇▁▁▁▁▁▁▁
numeric avg_days_on_arvs 0 21260 21260 NA NA NA 86.04 322.92 -85 0 7.5 72.63 5480 ▇▁▁▁▁▁▁▁
numeric avg_days_on_treatment 0 21260 21260 NA NA NA 80.21 210.05 0 0 33.25 92.23 16328.5 ▇▁▁▁▁▁▁▁
numeric avg_dbp 682 20578 21260 NA NA NA 70.66 9.79 0 64.25 70 76 156 ▁▁▁▇▂▁▁▁
numeric avg_oxy_sat 3708 17552 21260 NA NA NA 96.5 4.35 0 96 97.33 98 100 ▁▁▁▁▁▁▁▇
numeric avg_pulse 659 20601 21260 NA NA NA 90.25 17.53 0 78.5 88 99.67 198 ▁▁▂▇▃▁▁▁
numeric avg_sbp 678 20582 21260 NA NA NA 114.02 14.66 0 105 112.5 120.59 243 ▁▁▁▇▂▁▁▁
numeric avg_temp 1833 19427 21260 NA NA NA 36.46 0.66 26 36.1 36.47 36.8 42 ▁▁▁▁▂▇▁▁
numeric avg_weight 0 21260 21260 NA NA NA 59.43 12.28 0 51.67 58 65.65 181 ▁▁▇▂▁▁▁▁
numeric bmi 0 21260 21260 NA NA NA 28.46 133.23 0 19.02 21.26 24.13 7098.34 ▇▁▁▁▁▁▁▁
numeric first_cd4_perc 21253 7 21260 NA NA NA 29.14 22.58 2 10.5 26 48 59 ▇▃▁▃▁▁▇▃
numeric first_dbp 1393 19867 21260 NA NA NA 70.67 11.77 0 60 70 79 161 ▁▁▃▇▂▁▁▁
numeric first_oxy_sat 4625 16635 21260 NA NA NA 96.25 5.26 0 96 97 98 100 ▁▁▁▁▁▁▁▇
numeric first_pulse 1254 20006 21260 NA NA NA 91.19 20.78 0 78 88 103 216 ▁▁▆▇▂▁▁▁
numeric first_sbp 1382 19878 21260 NA NA NA 113.74 17.22 0 100 111 121 243 ▁▁▁▇▂▁▁▁
numeric first_temp 2987 18273 21260 NA NA NA 36.55 0.82 25.8 36.1 36.6 37 40.7 ▁▁▁▁▁▇▃▁
numeric first_weight 71 21189 21260 NA NA NA 58.98 12.92 0 51 58 65 181 ▁▁▇▂▁▁▁▁
numeric height 0 21260 21260 NA NA NA 163.66 16.48 10 159 165 171 260 ▁▁▁▁▇▆▁▁
numeric last_cd4_perc 21251 9 21260 NA NA NA 35.78 30.59 2 13 26 49 98 ▇▅▁▅▂▁▁▂
numeric last_dbp 682 20578 21260 NA NA NA 70.84 11.54 0 61 70 79 156 ▁▁▁▇▃▁▁▁
numeric last_oxy_sat 3708 17552 21260 NA NA NA 96.55 5.49 0 96 98 98 100 ▁▁▁▁▁▁▁▇
numeric last_pulse 659 20601 21260 NA NA NA 89.21 19.69 0 76 87 100 214 ▁▁▆▇▂▁▁▁
numeric last_sbp 678 20582 21260 NA NA NA 114.5 16.78 0 103 113 122 243 ▁▁▁▇▃▁▁▁
numeric last_temp 1833 19427 21260 NA NA NA 36.41 0.78 26 36 36.4 36.8 42.5 ▁▁▁▁▆▇▁▁
numeric last_weight 0 21260 21260 NA NA NA 59.88 12.93 0 52 59 66 181 ▁▁▇▂▁▁▁▁
numeric max_cd4_perc 21251 9 21260 NA NA NA 35.78 30.59 2 13 26 49 98 ▇▅▁▅▂▁▁▂
numeric max_dbp 682 20578 21260 NA NA NA 75.57 11 0 70 76 81 156 ▁▁▁▇▇▁▁▁
numeric max_oxy_sat 3708 17552 21260 NA NA NA 97.4 3.99 0 97 98 99 100 ▁▁▁▁▁▁▁▇
numeric max_pulse 659 20601 21260 NA NA NA 91.92 17.25 0 82 92 98 204 ▁▁▂▇▁▁▁▁
numeric max_sbp 678 20582 21260 NA NA NA 117.11 20.33 0 100 120 130 243 ▁▁▂▇▅▁▁▁
numeric max_temp 1833 19427 21260 NA NA NA 36.81 0.78 26 36.4 36.8 37.2 42.5 ▁▁▁▁▂▇▁▁
numeric max_weight 0 21260 21260 NA NA NA 61.34 12.56 0 53 60 68 181 ▁▁▇▃▁▁▁▁
numeric min_cd4_perc 21251 9 21260 NA NA NA 35.78 30.59 2 13 26 49 98 ▇▅▁▅▂▁▁▂
numeric min_dbp 682 20578 21260 NA NA NA 66.74 13.1 0 60 64 71 161 ▁▁▇▇▁▁▁▁
numeric min_oxy_sat 3708 17552 21260 NA NA NA 95.37 7.65 0 95 97 98 100 ▁▁▁▁▁▁▁▇
numeric min_pulse 659 20601 21260 NA NA NA 90.32 21.25 0 74 92 104 214 ▁▁▆▇▃▁▁▁
numeric min_sbp 678 20582 21260 NA NA NA 109.49 14.99 0 100 108 117 243 ▁▁▁▇▁▁▁▁
numeric min_temp 1833 19427 21260 NA NA NA 36.07 0.89 25.8 35.7 36.1 36.5 42 ▁▁▁▁▃▇▁▁
numeric min_weight 0 21260 21260 NA NA NA 57.84 14.83 0 50 56 64 187 ▁▂▇▁▁▁▁▁
numeric prop_bad_adherence 0 21260 21260 NA NA NA 0.0091 0.064 0 0 0 0 1 ▇▁▁▁▁▁▁▁
numeric prop_days_on_arvs 0 21260 21260 NA NA NA -Inf NaN -Inf 0 0 0.048 1 ▇▁▁▁▁▁▁▁
numeric prop_days_on_tb_meds 0 21260 21260 NA NA NA 0.025 0.14 0 0 0 0 1 ▇▁▁▁▁▁▁▁
numeric prop_days_on_tb_prop 0 21260 21260 NA NA NA 0.053 0.17 0 0 0 0 0.99 ▇▁▁▁▁▁▁▁
numeric prop_defaulted_apptmts 0 21260 21260 NA NA NA 0.039 0.11 0 0 0 0 0.8 ▇▁▁▁▁▁▁▁
numeric vl_count_1_log 13536 7724 21260 NA NA NA -2.72 9.54 -11.51 -11.51 -11.51 6.37 15.89 ▇▁▁▁▂▃▂▁
numeric vl2_suppression 16645 4615 21260 NA NA NA 0.8 0.4 0 1 1 1 1 ▂▁▁▁▁▁▁▇
numeric vl3_suppression 18850 2410 21260 NA NA NA 0.79 0.4 0 1 1 1 1 ▂▁▁▁▁▁▁▇
numeric vl4_suppression 20288 972 21260 NA NA NA 0.79 0.41 0 1 1 1 1 ▂▁▁▁▁▁▁▇
numeric vl5_suppression 21016 244 21260 NA NA NA 0.69 0.46 0 0 1 1 1 ▃▁▁▁▁▁▁▇

Training and Validation Sets

The dataset was split into training and out of sample validation set by ratio .9:.1 i.e 4615 (90%) and 512 (10%)

Logistic Regression Model

We used logistic regression estimates the probability of an outcome. Events are coded as binary variables with a value of 1 representing suppression, and a value of zero representing treatment failure

  Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
  ifelse(type == : prediction from a rank-deficient fit may be misleading

  Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
  ifelse(type == : prediction from a rank-deficient fit may be misleading
  
  Call:
  NULL
  
  Deviance Residuals: 
      Min       1Q   Median       3Q      Max  
  -2.5706  -0.9406   0.5530   0.8488   2.6402  
  
  Coefficients:
                             Estimate Std. Error z value Pr(>|z|)    
  (Intercept)               2.811e-01  1.935e-01   1.452 0.146381    
  first_age                 5.796e-03  2.996e-03   1.935 0.053018 .  
  bmi                       1.666e-04  2.077e-04   0.802 0.422385    
  prop_days_on_arvs        -5.293e-01  1.037e-01  -5.104 3.32e-07 ***
  prop_days_on_tb_meds      1.370e-01  2.023e-01   0.677 0.498447    
  prop_days_on_tb_prop      7.863e-01  1.883e-01   4.176 2.97e-05 ***
  prop_defaulted_apptmts   -6.869e-01  2.899e-01  -2.370 0.017809 *  
  prop_bad_adherence       -2.251e-01  5.701e-01  -0.395 0.692949    
  num_encounters            1.466e-02  9.382e-03   1.563 0.118139    
  vl_count_1_log            2.069e-03  4.853e-03   0.426 0.669849    
  first_who_stage           4.310e-02  3.415e-02   1.262 0.206876    
  first_arv_line           -2.390e-01  7.882e-02  -3.032 0.002432 ** 
  is_male                  -6.773e-01  6.559e-02 -10.326  < 2e-16 ***
  is_status_disclosed       2.096e-01  1.649e-01   1.271 0.203849    
  is_on_contraceptive       5.759e-01  8.660e-02   6.651 2.91e-11 ***
  is_on_health_cover       -4.904e-01  6.553e-02  -7.483 7.25e-14 ***
  is_on_cryptococcus_tx    -4.373e-01  3.223e-01  -1.357 0.174839    
  is_on_tb_prophy_regimen   8.352e-02  6.858e-02   1.218 0.223281    
  has_sti_symptoms         -4.931e-01  2.174e-01  -2.269 0.023293 *  
  has_tb_symptoms          -2.070e-01  1.047e-01  -1.978 0.047970 *  
  has_drug_tox_efcts        5.644e-01  3.410e-01   1.655 0.097901 .  
  has_toxic_drug            4.264e-03  5.486e-01   0.008 0.993798    
  has_referral_order        1.511e-01  7.034e-02   2.148 0.031693 *  
  has_phdp_referral         3.741e-01  1.071e-01   3.494 0.000477 ***
  needs_fam_tx_support      5.621e-01  8.993e-02   6.250 4.10e-10 ***
  has_changed_pcp          -1.567e+00  3.899e-01  -4.018 5.87e-05 ***
  has_changed_tb_tx         1.061e-01  3.130e-01   0.339 0.734711    
  has_restarted_tb_tx       1.058e+00  4.042e-01   2.617 0.008879 ** 
  has_been_hospitalized    -5.383e-01  1.786e-01  -3.015 0.002571 ** 
  has_sulf_peni_rxns       -2.163e-01  5.538e-01  -0.391 0.696065    
  is_general_pexam         -2.848e-01  1.034e-01  -2.754 0.005894 ** 
  is_skin_pexam             4.121e-01  1.325e-01   3.111 0.001865 ** 
  is_lymph_nodes_pexam      5.915e-01  3.188e-01   1.855 0.063530 .  
  is_respiratory_pexam      6.415e-01  2.289e-01   2.803 0.005069 ** 
  is_heent_pexam           -8.804e-01  2.895e-01  -3.042 0.002353 ** 
  is_cardiac_pexam          1.191e+01  1.701e+02   0.070 0.944160    
  is_abdominal_pexam        1.132e+00  5.236e-01   2.162 0.030603 *  
  is_urogenital_pexam       1.239e-01  7.945e-02   1.559 0.118941    
  is_extremies_pexam       -2.516e-01  4.389e-01  -0.573 0.566470    
  is_psychiatric_pexam      7.463e-01  8.262e-01   0.903 0.366382    
  is_neurologic_pexam      -4.674e-01  7.192e-01  -0.650 0.515757    
  is_musculoskeletal_pexam  4.173e-01  8.919e-01   0.468 0.639841    
  is_cxr_code_labs          2.151e-01  1.920e-01   1.120 0.262604    
  is_underweight            2.383e-01  9.167e-02   2.600 0.009330 ** 
  has_high_bp              -4.214e-02  6.544e-02  -0.644 0.519615    
  has_low_bp               -9.202e-01  2.989e-01  -3.079 0.002077 ** 
  has_abnormal_oxy_sat      9.641e-01  2.921e-01   3.301 0.000964 ***
  has_fever                -3.819e-01  2.366e-01  -1.614 0.106515    
  virologic_failure1       -2.019e+00  1.071e-01 -18.850  < 2e-16 ***
  ---
  Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  
  (Dispersion parameter for binomial family taken to be 1)
  
      Null deviance: 8623.8  on 6313  degrees of freedom
  Residual deviance: 7045.9  on 6265  degrees of freedom
  AIC: 7143.9
  
  Number of Fisher Scoring iterations: 11

Model Discrimination Analyisis

  
  Call:
  roc.formula(formula = relevel(as.factor(dataset$actual), "1") ~     dataset$prediction, plot = TRUE, print.auc = TRUE, thresholds = "best",     print.thres = "best", print.auc.y = 4, main = modelName,     percent = TRUE, ci = F, of = "thresholds")
  
  Data: dataset$prediction in 902 controls (relevel(as.factor(dataset$actual), "1") 1) > 3713 cases (relevel(as.factor(dataset$actual), "1") 0).
  Area under the curve: 72.04%

Confusion Matrix

  Confusion Matrix and Statistics
  
            Reference
  Prediction    1    0
           1  568  981
           0  334 2732
                                            
                 Accuracy : 0.7151          
                   95% CI : (0.7018, 0.7281)
      No Information Rate : 0.8046          
      P-Value [Acc > NIR] : 1               
                                            
                    Kappa : 0.2875          
                                            
   Mcnemar's Test P-Value : <2e-16          
                                            
              Sensitivity : 0.6297          
              Specificity : 0.7358          
           Pos Pred Value : 0.3667          
           Neg Pred Value : 0.8911          
               Prevalence : 0.1954          
           Detection Rate : 0.1231          
     Detection Prevalence : 0.3356          
        Balanced Accuracy : 0.6828          
                                            
         'Positive' Class : 1               
  

Out of Sample Validation

  Confusion Matrix and Statistics
  
            Reference
  Prediction   1   0
           1 114 144
           0  86 656
                                            
                 Accuracy : 0.77            
                   95% CI : (0.7426, 0.7958)
      No Information Rate : 0.8             
      P-Value [Acc > NIR] : 0.991248        
                                            
                    Kappa : 0.3517          
                                            
   Mcnemar's Test P-Value : 0.000171        
                                            
              Sensitivity : 0.5700          
              Specificity : 0.8200          
           Pos Pred Value : 0.4419          
           Neg Pred Value : 0.8841          
               Prevalence : 0.2000          
           Detection Rate : 0.1140          
     Detection Prevalence : 0.2580          
        Balanced Accuracy : 0.6950          
                                            
         'Positive' Class : 1               
  

Penalised Logistic Regression Model

Model summary

  49 x 1 sparse Matrix of class "dgCMatrix"
                                       1
  (Intercept)               0.3659385347
  first_age                 .           
  bmi                       .           
  prop_days_on_arvs        -0.0682323886
  prop_days_on_tb_meds      0.0433872586
  prop_days_on_tb_prop      0.5750124444
  prop_defaulted_apptmts    .           
  prop_bad_adherence        .           
  num_encounters            0.0043909959
  vl_count_1_log           -0.0081681485
  first_who_stage           0.0529748212
  first_arv_line            .           
  is_male                  -0.4227475399
  is_status_disclosed       .           
  is_on_contraceptive       0.3585342405
  is_on_health_cover       -0.2331863196
  is_on_cryptococcus_tx     .           
  is_on_tb_prophy_regimen   0.0773758945
  has_sti_symptoms          .           
  has_tb_symptoms          -0.0013162333
  has_drug_tox_efcts        .           
  has_toxic_drug            .           
  has_referral_order        0.1288846112
  has_phdp_referral         0.1098604608
  needs_fam_tx_support      0.2633168165
  has_changed_pcp          -0.0139558525
  has_changed_tb_tx         0.1286133577
  has_restarted_tb_tx       0.0846502319
  has_been_hospitalized    -0.6484629552
  has_sulf_peni_rxns       -0.1377638407
  is_general_pexam          .           
  is_skin_pexam             0.0277588432
  is_lymph_nodes_pexam      .           
  is_respiratory_pexam      0.0103840844
  is_heent_pexam           -0.2705097683
  is_cardiac_pexam          .           
  is_abdominal_pexam        0.3233126126
  is_urogenital_pexam       0.0008297731
  is_extremies_pexam        .           
  is_psychiatric_pexam      .           
  is_neurologic_pexam       .           
  is_musculoskeletal_pexam  0.0059279673
  is_cxr_code_labs          .           
  is_underweight            0.0100111218
  has_high_bp               .           
  has_low_bp               -0.0161505544
  has_abnormal_oxy_sat      .           
  has_fever                 .           
  virologic_failure1       -1.4657655240

Model Discrimination Analyisis

  
  Call:
  roc.formula(formula = relevel(as.factor(dataset$actual), "1") ~     dataset$prediction, plot = TRUE, print.auc = TRUE, thresholds = "best",     print.thres = "best", print.auc.y = 4, main = modelName,     percent = TRUE, ci = F, of = "thresholds")
  
  Data: dataset$prediction in 902 controls (relevel(as.factor(dataset$actual), "1") 1) > 3713 cases (relevel(as.factor(dataset$actual), "1") 0).
  Area under the curve: 71.55%

Confusion Matrix

  Confusion Matrix and Statistics
  
            Reference
  Prediction    1    0
           1  539  829
           0  363 2884
                                            
                 Accuracy : 0.7417          
                   95% CI : (0.7288, 0.7543)
      No Information Rate : 0.8046          
      P-Value [Acc > NIR] : 1               
                                            
                    Kappa : 0.3131          
                                            
   Mcnemar's Test P-Value : <2e-16          
                                            
              Sensitivity : 0.5976          
              Specificity : 0.7767          
           Pos Pred Value : 0.3940          
           Neg Pred Value : 0.8882          
               Prevalence : 0.1954          
           Detection Rate : 0.1168          
     Detection Prevalence : 0.2964          
        Balanced Accuracy : 0.6871          
                                            
         'Positive' Class : 1               
  

Out of Sample Validation

  Confusion Matrix and Statistics
  
            Reference
  Prediction   1   0
           1 100 105
           0 100 695
                                            
                 Accuracy : 0.795           
                   95% CI : (0.7686, 0.8196)
      No Information Rate : 0.8             
      P-Value [Acc > NIR] : 0.6705          
                                            
                    Kappa : 0.3653          
                                            
   Mcnemar's Test P-Value : 0.7800          
                                            
              Sensitivity : 0.5000          
              Specificity : 0.8688          
           Pos Pred Value : 0.4878          
           Neg Pred Value : 0.8742          
               Prevalence : 0.2000          
           Detection Rate : 0.1000          
     Detection Prevalence : 0.2050          
        Balanced Accuracy : 0.6844          
                                            
         'Positive' Class : 1               
  

KNN

In pattern and class recognition, the k-nearest neighbors algorithm (k-NN) is a non-parametric method used for classification and regression. k-NN is a type of instance-based learning, or lazy learning, where the function is only approximated locally and all computation is deferred until classification.

     user  system elapsed 
    2.125   0.044  28.202
  k-Nearest Neighbors 
  
  4615 samples
    48 predictor
     2 classes: 'Yes', 'No' 
  
  No pre-processing
  Resampling: Cross-Validated (10 fold) 
  Summary of sample sizes: 4154, 4152, 4153, 4154, 4153, 4154, ... 
  Addtional sampling using SMOTE
  
  Resampling results across tuning parameters:
  
    k   ROC        Sens       Spec     
    44  0.7041066  0.4567277  0.8419174
    45  0.6975941  0.4434066  0.8497247
  
  Sens was used to select the optimal model using the largest value.
  The final value used for the model was k = 44.

Model Discrimination Analyisis

  
  Call:
  roc.formula(formula = relevel(as.factor(dataset$actual), "1") ~     dataset$prediction, plot = TRUE, print.auc = TRUE, thresholds = "best",     print.thres = "best", print.auc.y = 4, main = modelName,     percent = TRUE, ci = F, of = "thresholds")
  
  Data: dataset$prediction in 902 controls (relevel(as.factor(dataset$actual), "1") 1) > 3713 cases (relevel(as.factor(dataset$actual), "1") 0).
  Area under the curve: 73.94%

Confusion Matrix

  Confusion Matrix and Statistics
  
            Reference
  Prediction    1    0
           1  563  987
           0  339 2726
                                            
                 Accuracy : 0.7127          
                   95% CI : (0.6994, 0.7257)
      No Information Rate : 0.8046          
      P-Value [Acc > NIR] : 1               
                                            
                    Kappa : 0.2817          
                                            
   Mcnemar's Test P-Value : <2e-16          
                                            
              Sensitivity : 0.6242          
              Specificity : 0.7342          
           Pos Pred Value : 0.3632          
           Neg Pred Value : 0.8894          
               Prevalence : 0.1954          
           Detection Rate : 0.1220          
     Detection Prevalence : 0.3359          
        Balanced Accuracy : 0.6792          
                                            
         'Positive' Class : 1               
  

Out of Sample Validation

  Confusion Matrix and Statistics
  
            Reference
  Prediction   1   0
           1 119 203
           0  81 597
                                            
                 Accuracy : 0.716           
                   95% CI : (0.6869, 0.7438)
      No Information Rate : 0.8             
      P-Value [Acc > NIR] : 1               
                                            
                    Kappa : 0.2777          
                                            
   Mcnemar's Test P-Value : 6.97e-13        
                                            
              Sensitivity : 0.5950          
              Specificity : 0.7462          
           Pos Pred Value : 0.3696          
           Neg Pred Value : 0.8805          
               Prevalence : 0.2000          
           Detection Rate : 0.1190          
     Detection Prevalence : 0.3220          
        Balanced Accuracy : 0.6706          
                                            
         'Positive' Class : 1               
  

please try https://rpubs.com/chengjiun/52658

Classification and Regression Trees (CART)

Classification and Regression Trees (CART) were first introducted in 1984 by a group led by Leo Briemann (Brieman et al. 1984). The CART algorithm provided a means to sequentially conduct binary splits on variables provided to the algorithm, resulting in a decision structure that resembles its namesake, a tree.

  CART 
  
  4615 samples
    48 predictor
     2 classes: 'Yes', 'No' 
  
  No pre-processing
  Resampling: Cross-Validated (10 fold) 
  Summary of sample sizes: 4154, 4154, 4153, 4153, 4153, 4154, ... 
  Addtional sampling using SMOTE
  
  Resampling results across tuning parameters:
  
    cp     ROC        Sens       Spec     
    0e+00  0.6489236  0.3912698  0.8661428
    1e-04  0.6502539  0.3890476  0.8720655
    2e-04  0.6595358  0.3801709  0.8841811
    3e-04  0.6670730  0.3723932  0.8973814
    4e-04  0.6738526  0.3657387  0.9057321
    5e-04  0.6805319  0.3557631  0.9178535
    6e-04  0.6824817  0.3546520  0.9210873
    7e-04  0.6870988  0.3591087  0.9254043
    8e-04  0.6870988  0.3591087  0.9254043
    9e-04  0.6884815  0.3535653  0.9289040
    1e-03  0.6874124  0.3535409  0.9289040
  
  Sens was used to select the optimal model using the largest value.
  The final value used for the model was cp = 0.

Tree Diagram

Model Discrimination Analyisis

  
  Call:
  roc.formula(formula = relevel(as.factor(dataset$actual), "1") ~     dataset$prediction, plot = TRUE, print.auc = TRUE, thresholds = "best",     print.thres = "best", print.auc.y = 4, main = modelName,     percent = TRUE, ci = F, of = "thresholds")
  
  Data: dataset$prediction in 902 controls (relevel(as.factor(dataset$actual), "1") 1) > 3713 cases (relevel(as.factor(dataset$actual), "1") 0).
  Area under the curve: 83.36%

Confusion Matrix

  Confusion Matrix and Statistics
  
            Reference
  Prediction    1    0
           1  566  399
           0  336 3314
                                            
                 Accuracy : 0.8407          
                   95% CI : (0.8299, 0.8512)
      No Information Rate : 0.8046          
      P-Value [Acc > NIR] : 1.189e-10       
                                            
                    Kappa : 0.5066          
                                            
   Mcnemar's Test P-Value : 0.0222          
                                            
              Sensitivity : 0.6275          
              Specificity : 0.8925          
           Pos Pred Value : 0.5865          
           Neg Pred Value : 0.9079          
               Prevalence : 0.1954          
           Detection Rate : 0.1226          
     Detection Prevalence : 0.2091          
        Balanced Accuracy : 0.7600          
                                            
         'Positive' Class : 1               
  

Out of Sample Validation

  Confusion Matrix and Statistics
  
            Reference
  Prediction   1   0
           1 118  72
           0  82 728
                                            
                 Accuracy : 0.846           
                   95% CI : (0.8221, 0.8678)
      No Information Rate : 0.8             
      P-Value [Acc > NIR] : 0.0001063       
                                            
                    Kappa : 0.5096          
                                            
   Mcnemar's Test P-Value : 0.4683044       
                                            
              Sensitivity : 0.5900          
              Specificity : 0.9100          
           Pos Pred Value : 0.6211          
           Neg Pred Value : 0.8988          
               Prevalence : 0.2000          
           Detection Rate : 0.1180          
     Detection Prevalence : 0.1900          
        Balanced Accuracy : 0.7500          
                                            
         'Positive' Class : 1               
  

XGBOOST

https://xgboost.readthedocs.io/en/latest/parameter.html

Relative Importance

                      Feature        Gain       Cover   Frequency
   1:      virologic_failure1 0.293615243 0.012821169 0.005049682
   2:          vl_count_1_log 0.083067386 0.132833167 0.129337026
   3:                     bmi 0.081313208 0.244015642 0.229027529
   4:      has_referral_order 0.073463121 0.008024496 0.013194331
   5:                 is_male 0.068524036 0.011444079 0.016940870
   6:      is_on_health_cover 0.062913742 0.009250477 0.016126405
   7:               first_age 0.045449906 0.093886460 0.115328229
   8:       prop_days_on_arvs 0.038685462 0.113080302 0.110604333
   9:          first_arv_line 0.033575324 0.009918803 0.012542759
  10:          num_encounters 0.030569616 0.030345682 0.062225118
  11:             has_high_bp 0.022368534 0.008865405 0.012216973
  12:    prop_days_on_tb_prop 0.018424878 0.066293104 0.053754683
  13:     is_urogenital_pexam 0.016909827 0.006929173 0.011728295
  14:         first_who_stage 0.015576339 0.020660829 0.020361622
  15:     is_on_contraceptive 0.015135534 0.009007978 0.009773579
  16:  prop_defaulted_apptmts 0.015061570 0.030476917 0.034207526
  17: is_on_tb_prophy_regimen 0.014798505 0.006629841 0.017266656
  18:       has_phdp_referral 0.014453970 0.005445195 0.006352826
  19:         has_tb_symptoms 0.012362540 0.006140207 0.007493077
  20:    needs_fam_tx_support 0.009067414 0.006716438 0.010913830
       Importance
   1: 0.293615243
   2: 0.083067386
   3: 0.081313208
   4: 0.073463121
   5: 0.068524036
   6: 0.062913742
   7: 0.045449906
   8: 0.038685462
   9: 0.033575324
  10: 0.030569616
  11: 0.022368534
  12: 0.018424878
  13: 0.016909827
  14: 0.015576339
  15: 0.015135534
  16: 0.015061570
  17: 0.014798505
  18: 0.014453970
  19: 0.012362540
  20: 0.009067414

Model Discrimination Analyisis

  
  Call:
  roc.formula(formula = relevel(as.factor(dataset$actual), "1") ~     dataset$prediction, plot = TRUE, print.auc = TRUE, thresholds = "best",     print.thres = "best", print.auc.y = 4, main = modelName,     percent = TRUE, ci = F, of = "thresholds")
  
  Data: dataset$prediction in 902 controls (relevel(as.factor(dataset$actual), "1") 1) > 3713 cases (relevel(as.factor(dataset$actual), "1") 0).
  Area under the curve: 97.35%

Confusion Matrix

  Confusion Matrix and Statistics
  
            Reference
  Prediction    1    0
           1  818  215
           0   84 3498
                                            
                 Accuracy : 0.9352          
                   95% CI : (0.9277, 0.9421)
      No Information Rate : 0.8046          
      P-Value [Acc > NIR] : < 2.2e-16       
                                            
                    Kappa : 0.8047          
                                            
   Mcnemar's Test P-Value : 5.558e-14       
                                            
              Sensitivity : 0.9069          
              Specificity : 0.9421          
           Pos Pred Value : 0.7919          
           Neg Pred Value : 0.9765          
               Prevalence : 0.1954          
           Detection Rate : 0.1772          
     Detection Prevalence : 0.2238          
        Balanced Accuracy : 0.9245          
                                            
         'Positive' Class : 1               
  

Out of Sample Validation

  Confusion Matrix and Statistics
  
            Reference
  Prediction   1   0
           1 175  37
           0  25 763
                                            
                 Accuracy : 0.938           
                   95% CI : (0.9212, 0.9521)
      No Information Rate : 0.8             
      P-Value [Acc > NIR] : <2e-16          
                                            
                    Kappa : 0.8105          
                                            
   Mcnemar's Test P-Value : 0.1624          
                                            
              Sensitivity : 0.8750          
              Specificity : 0.9537          
           Pos Pred Value : 0.8255          
           Neg Pred Value : 0.9683          
               Prevalence : 0.2000          
           Detection Rate : 0.1750          
     Detection Prevalence : 0.2120          
        Balanced Accuracy : 0.9144          
                                            
         'Positive' Class : 1               
  

GBM

https://xgboost.readthedocs.io/en/latest/parameter.html

     user  system elapsed 
    5.738   0.072 121.354
  Stochastic Gradient Boosting 
  
  4615 samples
    48 predictor
     2 classes: 'Yes', 'No' 
  
  No pre-processing
  Resampling: Cross-Validated (10 fold) 
  Summary of sample sizes: 4154, 4154, 4153, 4154, 4153, 4153, ... 
  Addtional sampling using SMOTE
  
  Resampling results across tuning parameters:
  
    interaction.depth  n.trees  ROC        Sens       Spec     
     9                 100      0.7197977  0.3736386  0.9383235
     9                 300      0.7054051  0.3802930  0.9210851
     9                 400      0.7045447  0.3825397  0.9181252
    10                 100      0.7218032  0.3768864  0.9380467
    10                 300      0.7144480  0.3990965  0.9248536
    10                 400      0.7080026  0.3968864  0.9127416
  
  Tuning parameter 'shrinkage' was held constant at a value of 0.1
  
  Tuning parameter 'n.minobsinnode' was held constant at a value of 20
  ROC was used to select the optimal model using the largest value.
  The final values used for the model were n.trees = 100,
   interaction.depth = 10, shrinkage = 0.1 and n.minobsinnode = 20.

Relative Importance

Model Discrimination Analyisis

  
  Call:
  roc.formula(formula = relevel(as.factor(dataset$actual), "1") ~     dataset$prediction, plot = TRUE, print.auc = TRUE, thresholds = "best",     print.thres = "best", print.auc.y = 4, main = modelName,     percent = TRUE, ci = F, of = "thresholds")
  
  Data: dataset$prediction in 902 controls (relevel(as.factor(dataset$actual), "1") 1) > 3713 cases (relevel(as.factor(dataset$actual), "1") 0).
  Area under the curve: 85.37%

Confusion Matrix

  Confusion Matrix and Statistics
  
            Reference
  Prediction    1    0
           1  472  241
           0  430 3472
                                            
                 Accuracy : 0.8546          
                   95% CI : (0.8441, 0.8647)
      No Information Rate : 0.8046          
      P-Value [Acc > NIR] : < 2.2e-16       
                                            
                    Kappa : 0.4979          
                                            
   Mcnemar's Test P-Value : 3.938e-13       
                                            
              Sensitivity : 0.5233          
              Specificity : 0.9351          
           Pos Pred Value : 0.6620          
           Neg Pred Value : 0.8898          
               Prevalence : 0.1954          
           Detection Rate : 0.1023          
     Detection Prevalence : 0.1545          
        Balanced Accuracy : 0.7292          
                                            
         'Positive' Class : 1               
  

Out of Sample Validation

  Confusion Matrix and Statistics
  
            Reference
  Prediction   1   0
           1  89  29
           0 111 771
                                            
                 Accuracy : 0.86            
                   95% CI : (0.8369, 0.8809)
      No Information Rate : 0.8             
      P-Value [Acc > NIR] : 4.791e-07       
                                            
                    Kappa : 0.483           
                                            
   Mcnemar's Test P-Value : 7.608e-12       
                                            
              Sensitivity : 0.4450          
              Specificity : 0.9637          
           Pos Pred Value : 0.7542          
           Neg Pred Value : 0.8741          
               Prevalence : 0.2000          
           Detection Rate : 0.0890          
     Detection Prevalence : 0.1180          
        Balanced Accuracy : 0.7044          
                                            
         'Positive' Class : 1               
  

Random Forest

Random forests improve predictive accuracy by generating a large number of bootstrapped trees (based on random samples of variables), classifying a case using each tree in this new “forest”, and deciding a final predicted outcome by combining the results across all of the trees (an average in regression, a majority vote in classification). Breiman and Cutler’s random forest approach is implimented via the randomForest package.

     user  system elapsed 
   15.809   0.168 141.097

Model Discrimination Analyisis

  
  Call:
  roc.formula(formula = relevel(as.factor(dataset$actual), "1") ~     dataset$prediction, plot = TRUE, print.auc = TRUE, thresholds = "best",     print.thres = "best", print.auc.y = 4, main = modelName,     percent = TRUE, ci = F, of = "thresholds")
  
  Data: dataset$prediction in 902 controls (relevel(as.factor(dataset$actual), "1") 1) > 3713 cases (relevel(as.factor(dataset$actual), "1") 0).
  Area under the curve: 98.47%

Confusion Matrix

  Confusion Matrix and Statistics
  
            Reference
  Prediction    1    0
           1  795  143
           0  107 3570
                                            
                 Accuracy : 0.9458          
                   95% CI : (0.9389, 0.9522)
      No Information Rate : 0.8046          
      P-Value [Acc > NIR] : < 2e-16         
                                            
                    Kappa : 0.8303          
                                            
   Mcnemar's Test P-Value : 0.02686         
                                            
              Sensitivity : 0.8814          
              Specificity : 0.9615          
           Pos Pred Value : 0.8475          
           Neg Pred Value : 0.9709          
               Prevalence : 0.1954          
           Detection Rate : 0.1723          
     Detection Prevalence : 0.2033          
        Balanced Accuracy : 0.9214          
                                            
         'Positive' Class : 1               
  

Relative Importance of Variables

According to https://dinsdalelab.sdsu.edu/metag.stats/code/randomforest.html, “the mean decrease in Gini coefficient is a measure of how each variable contributes to the homogeneity of the nodes and leaves in the resulting random forest

MeanDecreaseGini
first_age 133.19
bmi 141.84
prop_days_on_arvs 110.83
prop_days_on_tb_meds 38.00
prop_days_on_tb_prop 62.79
prop_defaulted_apptmts 76.49
prop_bad_adherence 15.16
num_encounters 112.13
vl_count_1_log 322.87
first_who_stage 71.64
first_arv_line 107.67
is_male 170.84
is_status_disclosed 12.20
is_on_contraceptive 108.82
is_on_health_cover 109.04
is_on_cryptococcus_tx 4.30
is_on_tb_prophy_regimen 129.16
has_sti_symptoms 9.48
has_tb_symptoms 39.53
has_drug_tox_efcts 3.30
has_toxic_drug 1.35
has_referral_order 110.91
has_phdp_referral 52.22
needs_fam_tx_support 47.79
has_changed_pcp 5.99
has_changed_tb_tx 3.52
has_restarted_tb_tx 2.48
has_been_hospitalized 36.63
has_sulf_peni_rxns 1.10
is_general_pexam 32.96
is_skin_pexam 14.97
is_lymph_nodes_pexam 3.28
is_respiratory_pexam 6.27
is_heent_pexam 5.10
is_cardiac_pexam 0.06
is_abdominal_pexam 1.06
is_urogenital_pexam 55.49
is_extremies_pexam 2.70
is_psychiatric_pexam 0.13
is_neurologic_pexam 0.61
is_musculoskeletal_pexam 0.23
is_cxr_code_labs 8.53
is_underweight 32.72
has_high_bp 110.37
has_low_bp 5.18
has_abnormal_oxy_sat 4.64
has_fever 7.68
virologic_failure1 374.03

Error Rate

This plot shows the class error rates of the random forest model. As the number of trees increases, the error rate approaches zero.

Out of Sample Validation

  Confusion Matrix and Statistics
  
            Reference
  Prediction   1   0
           1 168  21
           0  32 779
                                            
                 Accuracy : 0.947           
                   95% CI : (0.9312, 0.9601)
      No Information Rate : 0.8             
      P-Value [Acc > NIR] : <2e-16          
                                            
                    Kappa : 0.8309          
                                            
   Mcnemar's Test P-Value : 0.1696          
                                            
              Sensitivity : 0.8400          
              Specificity : 0.9738          
           Pos Pred Value : 0.8889          
           Neg Pred Value : 0.9605          
               Prevalence : 0.2000          
           Detection Rate : 0.1680          
     Detection Prevalence : 0.1890          
        Balanced Accuracy : 0.9069          
                                            
         'Positive' Class : 1               
  

BART (Bayesian Additive Regression Trees)

SVM

     user  system elapsed 
   17.129   3.494 215.886
  Support Vector Machines with Radial Basis Function Kernel 
  
  4615 samples
    48 predictor
     2 classes: 'Yes', 'No' 
  
  No pre-processing
  Resampling: Cross-Validated (10 fold) 
  Summary of sample sizes: 4153, 4154, 4154, 4154, 4154, 4153, ... 
  Addtional sampling using SMOTE
  
  Resampling results across tuning parameters:
  
    C     ROC        Sens       Spec     
    0.25  0.6810117  0.4190965  0.8685701
    0.50  0.6943176  0.4268254  0.8742305
    1.00  0.6998193  0.4102076  0.8736921
  
  Tuning parameter 'sigma' was held constant at a value of 0.02132118
  ROC was used to select the optimal model using the largest value.
  The final values used for the model were sigma = 0.02132118 and C = 1.

Model Discrimination Analyisis

  
  Call:
  roc.formula(formula = relevel(as.factor(dataset$actual), "1") ~     dataset$prediction, plot = TRUE, print.auc = TRUE, thresholds = "best",     print.thres = "best", print.auc.y = 4, main = modelName,     percent = TRUE, ci = F, of = "thresholds")
  
  Data: dataset$prediction in 902 controls (relevel(as.factor(dataset$actual), "1") 1) > 3713 cases (relevel(as.factor(dataset$actual), "1") 0).
  Area under the curve: 83.19%

Confusion Matrix

  Confusion Matrix and Statistics
  
            Reference
  Prediction    1    0
           1  636  609
           0  266 3104
                                            
                 Accuracy : 0.8104          
                   95% CI : (0.7988, 0.8216)
      No Information Rate : 0.8046          
      P-Value [Acc > NIR] : 0.1627          
                                            
                    Kappa : 0.473           
                                            
   Mcnemar's Test P-Value : <2e-16          
                                            
              Sensitivity : 0.7051          
              Specificity : 0.8360          
           Pos Pred Value : 0.5108          
           Neg Pred Value : 0.9211          
               Prevalence : 0.1954          
           Detection Rate : 0.1378          
     Detection Prevalence : 0.2698          
        Balanced Accuracy : 0.7705          
                                            
         'Positive' Class : 1               
  

Out of Sample Validation

  Confusion Matrix and Statistics
  
            Reference
  Prediction   1   0
           1 124 102
           0  76 698
                                            
                 Accuracy : 0.822           
                   95% CI : (0.7969, 0.8452)
      No Information Rate : 0.8             
      P-Value [Acc > NIR] : 0.04311         
                                            
                    Kappa : 0.4696          
                                            
   Mcnemar's Test P-Value : 0.06095         
                                            
              Sensitivity : 0.6200          
              Specificity : 0.8725          
           Pos Pred Value : 0.5487          
           Neg Pred Value : 0.9018          
               Prevalence : 0.2000          
           Detection Rate : 0.1240          
     Detection Prevalence : 0.2260          
        Balanced Accuracy : 0.7463          
                                            
         'Positive' Class : 1               
  

Comparative Analysis

Sensitivity And Specificity Analysis

  
  Call:
  summary.resamples(object = resamps)
  
  Models: XGBoost, GBM, BART, RandomForest, SVMRadial, GLMNET, OLSLogistic, CART, KNN 
  Number of resamples: 10 
  
  ROC 
                    Min.   1st Qu.    Median      Mean   3rd Qu.      Max.
  XGBoost      0.8596014 0.9088609 0.9322113 0.9289775 0.9481066 1.0000000
  GBM          0.9188805 0.9237962 0.9349068 0.9397287 0.9468209 0.9808322
  BART         0.8943432 0.9140283 0.9230949 0.9322802 0.9506779 0.9914611
  RandomForest 0.8747605 0.8934152 0.9021964 0.9048755 0.9226003 0.9293582
  SVMRadial    0.7824000 0.8392000 0.8824000 0.8762667 0.9132000 0.9434667
  GLMNET       0.7482482 0.8369514 0.8561110 0.8510192 0.8746921 0.9079079
  OLSLogistic  0.7592202 0.8080992 0.8386175 0.8379713 0.8610967 0.9178082
  CART         0.7633300 0.8455481 0.8918181 0.8723098 0.9055649 0.9350198
  KNN          0.7171487 0.7736976 0.8128080 0.8249873 0.8974142 0.9308732
               NA's
  XGBoost         0
  GBM             0
  BART            0
  RandomForest    0
  SVMRadial       0
  GLMNET          0
  OLSLogistic     0
  CART            0
  KNN             0
  
  Sens 
                    Min.   1st Qu.    Median      Mean   3rd Qu.      Max.
  XGBoost      0.7741935 0.7875504 0.8225806 0.8399194 0.8641633 1.0000000
  GBM          0.7812500 0.8387097 0.8573589 0.8560484 0.8709677 0.9354839
  BART         0.7741935 0.8190524 0.8387097 0.8365927 0.8641633 0.8709677
  RandomForest 0.6206897 0.7500000 0.7857143 0.7604680 0.8143473 0.8275862
  SVMRadial    0.6000000 0.6900000 0.7400000 0.7360000 0.7600000 0.8800000
  GLMNET       0.5555556 0.6538462 0.6794872 0.6851852 0.7307692 0.7692308
  OLSLogistic  0.5769231 0.6356838 0.6730769 0.6962963 0.7382479 0.9230769
  CART         0.6071429 0.6958128 0.7721675 0.7498768 0.7912562 0.9285714
  KNN          0.4814815 0.5808405 0.7307692 0.7052707 0.8269231 0.8846154
               NA's
  XGBoost         0
  GBM             0
  BART            0
  RandomForest    0
  SVMRadial       0
  GLMNET          0
  OLSLogistic     0
  CART            0
  KNN             0
  
  Spec 
                    Min.   1st Qu.    Median      Mean   3rd Qu.      Max.
  XGBoost      0.8985507 0.9420290 0.9489557 0.9564152 0.9817775 1.0000000
  GBM          0.9130435 0.9600384 0.9708014 0.9622336 0.9710145 0.9855072
  BART         0.8405797 0.9130435 0.9492754 0.9376385 0.9670716 1.0000000
  RandomForest 0.9436620 0.9444444 0.9444444 0.9567488 0.9683099 0.9861111
  SVMRadial    0.8666667 0.9333333 0.9333333 0.9360000 0.9566667 0.9733333
  GLMNET       0.8783784 0.9220659 0.9388190 0.9415957 0.9695946 0.9864865
  OLSLogistic  0.8378378 0.8949463 0.9183636 0.9117734 0.9324324 0.9459459
  CART         0.8732394 0.8990610 0.9305556 0.9272692 0.9546655 0.9722222
  KNN          0.7671233 0.8141892 0.8503332 0.8410959 0.8614865 0.9324324
               NA's
  XGBoost         0
  GBM             0
  BART            0
  RandomForest    0
  SVMRadial       0
  GLMNET          0
  OLSLogistic     0
  CART            0
  KNN             0

ROCs

## Discriminatory

Model Interpretation

Cutoff Optimization

\(E(Y_2|Y_1,V_0)\): Model performance comparison for the second virologic failure using different probability threshold
Model Cutoff Sens Spec Accuracy PPV NPV F1 Bal Acc Kappa
XGBoost 0.10 0.88 0.90 0.89 0.81 0.94 0.84 0.89 0.76
XGBoost 0.15 0.87 0.92 0.90 0.83 0.94 0.85 0.89 0.78
XGBoost 0.20 0.87 0.92 0.91 0.84 0.94 0.85 0.90 0.79
XGBoost 0.25 0.86 0.94 0.91 0.86 0.94 0.86 0.90 0.80
XGBoost 0.30 0.85 0.94 0.91 0.87 0.93 0.86 0.90 0.80
XGBoost 0.35 0.85 0.95 0.92 0.89 0.93 0.87 0.90 0.81
XGBoost 0.40 0.85 0.95 0.92 0.89 0.93 0.87 0.90 0.81
XGBoost 0.45 0.85 0.95 0.92 0.90 0.93 0.87 0.90 0.81
XGBoost 0.50 0.84 0.96 0.92 0.90 0.93 0.87 0.90 0.81
BART 0.10 0.90 0.87 0.88 0.77 0.95 0.83 0.89 0.74
BART 0.15 0.89 0.91 0.90 0.82 0.95 0.85 0.90 0.78
BART 0.20 0.89 0.93 0.92 0.86 0.95 0.87 0.91 0.81
BART 0.25 0.88 0.94 0.92 0.88 0.95 0.88 0.91 0.82
BART 0.30 0.88 0.94 0.92 0.88 0.95 0.88 0.91 0.82
BART 0.35 0.87 0.95 0.92 0.89 0.94 0.88 0.91 0.82
BART 0.40 0.87 0.95 0.93 0.90 0.94 0.88 0.91 0.83
BART 0.45 0.86 0.96 0.93 0.91 0.94 0.88 0.91 0.83
BART 0.50 0.86 0.96 0.93 0.91 0.94 0.88 0.91 0.83
GBM 0.10 0.92 0.75 0.81 0.63 0.96 0.75 0.84 0.60
GBM 0.15 0.91 0.80 0.83 0.67 0.95 0.77 0.85 0.65
GBM 0.20 0.91 0.83 0.85 0.71 0.95 0.80 0.87 0.69
GBM 0.25 0.89 0.86 0.87 0.76 0.95 0.82 0.88 0.72
GBM 0.30 0.88 0.89 0.88 0.78 0.94 0.83 0.88 0.74
GBM 0.35 0.87 0.91 0.90 0.82 0.94 0.84 0.89 0.76
GBM 0.40 0.86 0.93 0.91 0.85 0.93 0.85 0.89 0.78
GBM 0.45 0.85 0.93 0.91 0.86 0.93 0.85 0.89 0.78
GBM 0.50 0.84 0.94 0.91 0.87 0.93 0.85 0.89 0.78
RandomForest 0.10 0.92 0.53 0.64 0.44 0.94 0.59 0.73 0.34
RandomForest 0.15 0.89 0.75 0.79 0.59 0.95 0.71 0.82 0.55
RandomForest 0.20 0.84 0.85 0.85 0.70 0.93 0.76 0.84 0.65
RandomForest 0.25 0.82 0.90 0.88 0.77 0.93 0.79 0.86 0.71
RandomForest 0.30 0.79 0.93 0.89 0.82 0.92 0.80 0.86 0.73
RandomForest 0.35 0.79 0.94 0.90 0.85 0.92 0.82 0.87 0.75
RandomForest 0.40 0.77 0.95 0.90 0.87 0.91 0.82 0.86 0.75
RandomForest 0.45 0.77 0.96 0.90 0.88 0.91 0.82 0.86 0.75
RandomForest 0.50 0.76 0.96 0.90 0.88 0.91 0.81 0.86 0.75
SVMRadial 0.10 0.83 0.73 0.76 0.51 0.93 0.63 0.78 0.47
SVMRadial 0.15 0.82 0.81 0.81 0.60 0.93 0.69 0.82 0.56
SVMRadial 0.20 0.81 0.84 0.83 0.63 0.93 0.71 0.83 0.59
SVMRadial 0.25 0.81 0.86 0.85 0.66 0.93 0.73 0.83 0.62
SVMRadial 0.30 0.80 0.88 0.86 0.70 0.93 0.74 0.84 0.65
SVMRadial 0.35 0.79 0.90 0.87 0.72 0.93 0.75 0.84 0.66
SVMRadial 0.40 0.78 0.91 0.88 0.75 0.92 0.76 0.85 0.68
SVMRadial 0.45 0.76 0.93 0.89 0.79 0.92 0.77 0.84 0.70
SVMRadial 0.50 0.74 0.94 0.89 0.80 0.91 0.76 0.84 0.69
GLMNET 0.10 0.95 0.22 0.41 0.30 0.93 0.46 0.59 0.10
GLMNET 0.15 0.87 0.48 0.59 0.38 0.91 0.53 0.68 0.25
GLMNET 0.20 0.81 0.66 0.70 0.46 0.91 0.59 0.73 0.38
GLMNET 0.25 0.78 0.74 0.75 0.52 0.90 0.63 0.76 0.45
GLMNET 0.30 0.77 0.82 0.81 0.61 0.91 0.68 0.80 0.55
GLMNET 0.35 0.75 0.88 0.85 0.70 0.91 0.72 0.82 0.62
GLMNET 0.40 0.72 0.91 0.86 0.75 0.90 0.74 0.82 0.64
GLMNET 0.45 0.71 0.93 0.87 0.80 0.90 0.75 0.82 0.66
GLMNET 0.50 0.69 0.94 0.87 0.82 0.89 0.74 0.81 0.66
OLSLogistic 0.10 0.84 0.55 0.63 0.40 0.91 0.54 0.70 0.29
OLSLogistic 0.15 0.80 0.65 0.69 0.45 0.90 0.58 0.72 0.36
OLSLogistic 0.20 0.78 0.73 0.74 0.51 0.90 0.62 0.75 0.44
OLSLogistic 0.25 0.76 0.79 0.78 0.57 0.90 0.65 0.77 0.49
OLSLogistic 0.30 0.74 0.83 0.81 0.62 0.90 0.67 0.79 0.54
OLSLogistic 0.35 0.73 0.86 0.83 0.66 0.90 0.68 0.79 0.56
OLSLogistic 0.40 0.71 0.88 0.84 0.69 0.90 0.69 0.80 0.58
OLSLogistic 0.45 0.70 0.90 0.85 0.71 0.90 0.70 0.80 0.60
OLSLogistic 0.50 0.70 0.91 0.86 0.74 0.89 0.71 0.80 0.62
CART 0.10 0.80 0.89 0.86 0.75 0.92 0.77 0.84 0.67
CART 0.15 0.79 0.92 0.88 0.80 0.92 0.79 0.85 0.71
CART 0.20 0.79 0.92 0.88 0.80 0.92 0.79 0.85 0.71
CART 0.25 0.79 0.92 0.88 0.81 0.92 0.79 0.85 0.71
CART 0.30 0.79 0.92 0.88 0.81 0.92 0.79 0.85 0.71
CART 0.35 0.78 0.92 0.88 0.81 0.91 0.79 0.85 0.71
CART 0.40 0.76 0.92 0.88 0.80 0.91 0.78 0.84 0.70
CART 0.45 0.75 0.93 0.88 0.81 0.91 0.78 0.84 0.69
CART 0.50 0.75 0.93 0.88 0.82 0.91 0.78 0.84 0.70
KNN 0.10 0.86 0.40 0.53 0.34 0.90 0.49 0.63 0.18
KNN 0.15 0.83 0.58 0.64 0.41 0.90 0.55 0.70 0.31
KNN 0.20 0.81 0.66 0.70 0.46 0.91 0.58 0.73 0.37
KNN 0.25 0.80 0.69 0.72 0.48 0.91 0.60 0.74 0.40
KNN 0.30 0.77 0.72 0.73 0.50 0.90 0.60 0.75 0.42
KNN 0.35 0.77 0.74 0.74 0.51 0.90 0.61 0.75 0.44
KNN 0.40 0.75 0.77 0.76 0.54 0.90 0.63 0.76 0.46
KNN 0.45 0.73 0.80 0.78 0.56 0.89 0.63 0.76 0.48
KNN 0.50 0.70 0.85 0.81 0.62 0.89 0.65 0.77 0.52

Stacked Ensembling.

Model Discrimination Analyisis

Confusion Matrix

Relative Importance

                      Feature        Gain       Cover   Frequency
   1:      virologic_failure1 0.293615243 0.012821169 0.005049682
   2:          vl_count_1_log 0.083067386 0.132833167 0.129337026
   3:                     bmi 0.081313208 0.244015642 0.229027529
   4:      has_referral_order 0.073463121 0.008024496 0.013194331
   5:                 is_male 0.068524036 0.011444079 0.016940870
   6:      is_on_health_cover 0.062913742 0.009250477 0.016126405
   7:               first_age 0.045449906 0.093886460 0.115328229
   8:       prop_days_on_arvs 0.038685462 0.113080302 0.110604333
   9:          first_arv_line 0.033575324 0.009918803 0.012542759
  10:          num_encounters 0.030569616 0.030345682 0.062225118
  11:             has_high_bp 0.022368534 0.008865405 0.012216973
  12:    prop_days_on_tb_prop 0.018424878 0.066293104 0.053754683
  13:     is_urogenital_pexam 0.016909827 0.006929173 0.011728295
  14:         first_who_stage 0.015576339 0.020660829 0.020361622
  15:     is_on_contraceptive 0.015135534 0.009007978 0.009773579
  16:  prop_defaulted_apptmts 0.015061570 0.030476917 0.034207526
  17: is_on_tb_prophy_regimen 0.014798505 0.006629841 0.017266656
  18:       has_phdp_referral 0.014453970 0.005445195 0.006352826
  19:         has_tb_symptoms 0.012362540 0.006140207 0.007493077
  20:    needs_fam_tx_support 0.009067414 0.006716438 0.010913830
       Importance
   1: 0.293615243
   2: 0.083067386
   3: 0.081313208
   4: 0.073463121
   5: 0.068524036
   6: 0.062913742
   7: 0.045449906
   8: 0.038685462
   9: 0.033575324
  10: 0.030569616
  11: 0.022368534
  12: 0.018424878
  13: 0.016909827
  14: 0.015576339
  15: 0.015135534
  16: 0.015061570
  17: 0.014798505
  18: 0.014453970
  19: 0.012362540
  20: 0.009067414

Appendix

Kappa - similar to Accuracy score, but it takes into account the accuracy that would have happened simply by chance alone. Here is one possible interpretation of Kappa. * Poor agreement = Less than 0.20 * Fair agreement = 0.20 to 0.40 * Moderate agreement = 0.40 to 0.60 * Good agreement = 0.60 to 0.80 * Very good agreement = 0.80 to 1.00

Allan Kimaina

May 5, 2020